Skip to content

Improved robots.txt handling#46

Closed
jakekdodd wants to merge 2 commits intomasterfrom
improved-robots
Closed

Improved robots.txt handling#46
jakekdodd wants to merge 2 commits intomasterfrom
improved-robots

Conversation

@jakekdodd
Copy link
Copy Markdown
Contributor

This PR adds a new mechanism for robots.txt caching and filtering.

Among the changes are:

  • A new thread-safe, in-memory singleton cache for robots rules
  • A RobotsURLFilter
  • Moves the getCacheKey() method to the cache implementation because it's used in several locations
  • The ParserBolt ensures that the robots rules are loaded into the shared cache, which means conditionally refetching and parsing the domain's robots.txt file

Jake Dodd added 2 commits December 17, 2014 08:45
This commit adds an interface for robots caches, an in-memory,
thread-safe cache implementation, and a basic URLFilter for robots
rules.
@jakekdodd
Copy link
Copy Markdown
Contributor Author

A few notes on the new PR:

  • I decided against overloading the URLFilters.filter() method to include metadata, since it wouldn't help us here. Because the metadata is a Map<String, String[]>, the robots rules would need to be serialized as a string array.
  • The getCacheKey() method has been moved to the cache implementation, since its results are needed in several classes
  • You won't see any explicit behavior in the ParserBolt that conditionally refetches robots.txt, but it's there. The method protocol.getRobotsRules() has non-explicit side effects; it conditionally fetches and parses the robots.txt file if there's a cache miss. Eventually, this should be re-worked, because HTTP requests + hitting a concurrent cache can have performance impacts on a topology that might be difficult to debug.
  • There are several ways this could be improved, but it works; I'm hesitant to let 'better' become the enemy of 'good enough' :)

@jnioche
Copy link
Copy Markdown
Contributor

jnioche commented Dec 23, 2014

Hi Jake. I don't really like the changes to the ParserBolt. Either the FetcherBolt stores the robots info via the metadata (the value can be an array containing a single string) as it is available there already or the url filter itself fetches the robots file, but that should definitely not be done by the ParserBolt.
This is not a hugely important functionality, maybe we could leave it aside for now if we can't find a nice way of doing it.

@jakekdodd
Copy link
Copy Markdown
Contributor Author

I'm not a fan of the implementation either, but there's a reason for fetching the robots file in the ParserBolt rather than within the filter.

Making the call to protocol.getRobotRules(url) within the filter itself will be seriously problematic when the outlinks have been gathered from an aggregator page ( Drudge Report is a good example, some pages would be even worse). Filtering those outlinks would result in a flurry of several hundred HTTP requests, because protocol.getRobotRules() automatically fetches the robots rules if there's a cache miss.

Other solutions I entertained are:

  1. Fetching within the filter, but having the filter behave differently depending on wether the link has the same hostname as the source URL. This requires passing the source URL to the filter. The logic would call protocol.getRobotRules() for links with the same hostname as the source URL, and a direct cache query for all others.

  2. Requiring that robots.txt be included as a string in the metadata, as we previously discussed. But in this scenario, the most efficient way to make sure the cache is loaded would still be to parse robots.txt in the ParserBolt, rather than to give the robots.txt string to the filter, check for a cache hit, and conditionally parse the string within the filter itself.

(2) is still an option, but the only fundamental difference is that the ParserBolt doesn't make an HTTP request for robots.txt.

@jnioche
Copy link
Copy Markdown
Contributor

jnioche commented Jan 5, 2015

It would make sense to pass the source URL as metadata anyway (great for debugging) and that could indeed be used to determine whether or not to do the sitemap checks.

I am planning to add a utility class that would help generate metadata for outlinks given the metadata for the source page. This would be configurable of course and could be used to add the 'origin' metadata.

The logic would call protocol.getRobotRules() for links with the same hostname as the source URL, and a direct cache query for all others.

I presume that 'direct cache query' means hitting the same cache instance as used by the Fetcher, which in the case of the memory based one would work if it is located in the same JVM. If so, why not do that when the hostname is the same as the source as well? it should definitely be there.

@jakekdodd
Copy link
Copy Markdown
Contributor Author

I presume that 'direct cache query' means hitting the same cache instance as used by the Fetcher, which in the case of the memory based one would work if it is located in the same JVM. If so, why not do that when the hostname is the same as the source as well? it should definitely be there.

The assumption is that many or most outlinks will be internal, so we would guarantee that the robots rules would be available for outlinks with the same hostname as the source URL; for external outlinks, the robots rules might be available for filtering, depending on a number of factors.

To guarantee that robots rules for the hostname are in the JVM-local cache, we need to either fetch or re-parse the robots.txt file within the same process as the filters. For external outlinks, the rules might be present in the cache; if they are, then this is a secondary benefit.

Keeping the filter straightforward (give it a cache, feed it URLs to filter) leaves the cache-filling strategy outside of the filter, which is where I think it should be.

@jnioche
Copy link
Copy Markdown
Contributor

jnioche commented Jan 6, 2015

I agree with your definition of what the filter does (i.e. guaranteed for same hostname).

Most of the time the fetcher and parser will be running on the same JVM - if not the performance of the crawler would certainly suffer a lot. I'd be happy with a simpler solution where we expect this to be the case.

If we want a 100% guarantee then indeed we need to check the cache and if not found there, refetch and parse the robots.txt. What I think we disagree on is where it should be done. I don't think it should be done in the parser at all, its purpose is to be generic and moreover, people could have various implementations of parsing bolts (Gui wrote one for instance to deal with HTML exclusively). We'd want them to be able to reuse the parsingfilters as-is without having to write any bespoke code that would be a prerequisite for any of these filters to work. If we need to fetch+parse the robots rules then it should be done within the filter itself.

Again, there would be something wrong in the users' setup if the fetcher and parsers were not running in the same VM. I think it is reasonable to just reuse the cache filled by the Fetcher.

@jakekdodd
Copy link
Copy Markdown
Contributor Author

Let's close this PR, and revisit when #36 and #53 are closed. We'll keep the related issue (#24) open

@jakekdodd jakekdodd closed this Jan 14, 2015
@jnioche jnioche deleted the improved-robots branch October 20, 2015 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants